{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Generating names with recurrent neural networks (5 points)\n",
    "\n",
    "This time you'll find yourself delving into the heart (and other intestines) of recurrent neural networks on a class of toy problems.\n",
    "\n",
    "Struggle to find a name for the variable? Let's see how you'll come up with a name for your son/daughter. Surely no human has expertize over what is a good child name, so let us train RNN instead;\n",
    "\n",
    "It's dangerous to go alone, take these:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np",
    "\n",
    "import theano",
    "\n",
    "import theano.tensor as T",
    "\n",
    "import lasagne",
    "\n",
    "import os",
    "\n",
    "#thanks @keskarnitish"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start_token = \" \"",
    "\n",
    "\n",
    "with open(\"names\") as f:",
    "\n",
    "    names = f.read()[:-1].split('\\n')",
    "\n",
    "    names = [start_token+name for name in names]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('n samples = ', len(names))",
    "\n",
    "for x in names[::1000]:",
    "\n",
    "    print(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Text processing\n",
    "\n",
    "First we need next to collect a \"vocabulary\" of all unique tokens i.e. unique characters. We can then encode inputs as a sequence of character ids."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# all unique characters go here",
    "\n",
    "tokens = <all unique characters in the dataset >",
    "\n",
    "\n",
    "tokens = list(tokens)",
    "\n",
    "print('n_tokens = ', len(tokens))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Theano is built for numbers, not strings of characters.\n",
    "We'll feed our recurrent neural network with ids of characters from our dictionary.\n",
    "\n",
    "To create such dictionary, let's assign each character with it's index in tokens list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "token_to_id = <dictionary of symbol -> its identifier(index in tokens list) >",
    "\n",
    "\n",
    "id_to_token = < dictionary of symbol identifier -> symbol itself >"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt",
    "\n",
    "%matplotlib inline",
    "\n",
    "plt.hist(list(map(len, names)), bins=25)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# truncate names longer than MAX_LEN characters.",
    "\n",
    "MAX_LEN = ?!",
    "\n",
    "\n",
    "# you will likely need to change this for any dataset different from \"names\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cast everything from symbols into identifiers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "names_ix = list(map(lambda name: list(map(token_to_id.get, name)), names))",
    "\n",
    "\n",
    "\n",
    "# crop long names and pad short ones",
    "\n",
    "for i in range(len(names_ix)):",
    "\n",
    "    names_ix[i] = names_ix[i][:MAX_LEN]  # crop too long",
    "\n",
    "\n",
    "    if len(names_ix[i]) < MAX_LEN:",
    "\n",
    "        names_ix[i] += [token_to_id[\" \"]] * \\",
    "\n",
    "            (MAX_LEN - len(names_ix[i]))  # pad too short",
    "\n",
    "\n",
    "assert len(set(map(len, names_ix))) == 1",
    "\n",
    "\n",
    "names_ix = np.array(names_ix)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Input variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_sequence = T.matrix('token sequencea', 'int32')",
    "\n",
    "target_values = T.matrix('actual next token', 'int32')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Build NN\n",
    "\n",
    "You will be building a model that takes token sequence and predicts next token\n",
    "\n",
    "\n",
    "* iput sequence\n",
    "* one-hot / embedding\n",
    "* recurrent layer(s)\n",
    "* otput layer(s) that predict output probabilities\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from lasagne.layers import InputLayer, DenseLayer, EmbeddingLayer",
    "\n",
    "from lasagne.layers import RecurrentLayer, LSTMLayer, GRULayer, CustomRecurrentLayer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "l_in = lasagne.layers.InputLayer(shape=(None, None), input_var=input_sequence)",
    "\n",
    "\n",
    "#!<Your neural network>",
    "\n",
    "l_emb = <embedding layer or one-hot encoding >",
    "\n",
    "\n",
    "l_rnn = <some recurrent layer(or several such layers) >",
    "\n",
    "\n",
    "# flatten batch and time to be compatible with feedforward layers (will un-flatten later)",
    "\n",
    "l_rnn_flat = lasagne.layers.reshape(l_rnn, (-1, l_rnn.output_shape[-1]))",
    "\n",
    "\n",
    "l_out = <last dense layer(or several layers), returning probabilities for all possible next tokens >"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Model weights",
    "\n",
    "weights = lasagne.layers.get_all_params(l_out, trainable=True)",
    "\n",
    "print(weights)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "network_output = <NN output via lasagne >",
    "\n",
    "# If you use dropout do not forget to create deterministic version for evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "predicted_probabilities_flat = network_output",
    "\n",
    "correct_answers_flat = target_values.ravel()",
    "\n",
    "\n",
    "\n",
    "loss = <loss function - a simple categorical crossentropy will do, maybe add some regularizer >",
    "\n",
    "\n",
    "updates = <your favorite optimizer >"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Compiling it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# training",
    "\n",
    "train = theano.function([input_sequence, target_values],",
    "\n",
    "                        loss, updates=updates, allow_input_downcast=True)",
    "\n",
    "\n",
    "# computing loss without training",
    "\n",
    "compute_cost = theano.function(",
    "\n",
    "    [input_sequence, target_values], loss, allow_input_downcast=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# generation\n",
    "\n",
    "Simple: \n",
    "* get initial context(seed), \n",
    "* predict next token probabilities,\n",
    "* sample next token, \n",
    "* add it to the context\n",
    "* repeat from step 2\n",
    "\n",
    "You'll get a more detailed info on how it works in the homework section."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# compile the function that computes probabilities for next token given previous text.",
    "\n",
    "\n",
    "# reshape back into original shape",
    "\n",
    "next_word_probas = network_output.reshape(",
    "\n",
    "    (input_sequence.shape[0], input_sequence.shape[1], len(tokens)))",
    "\n",
    "# predictions for next tokens (after sequence end)",
    "\n",
    "last_word_probas = next_word_probas[:, -1]",
    "\n",
    "probs = theano.function(",
    "\n",
    "    [input_sequence], last_word_probas, allow_input_downcast=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "def generate_sample(seed_phrase=None, N=MAX_LEN, t=1, n_snippets=1):",
    "\n",
    "    '''",
    "\n",
    "    The function generates text given a phrase of length at least SEQ_LENGTH.",
    "\n",
    "\n",
    "    parameters:",
    "\n",
    "        sample_fun - max_ or proportional_sample_fun or whatever else you implemented",
    "\n",
    "\n",
    "        The phrase is set using the variable seed_phrase",
    "\n",
    "\n",
    "        The optional input \"N\" is used to set the number of characters of text to predict.     ",
    "\n",
    "    '''",
    "\n",
    "    if seed_phrase is None:",
    "\n",
    "        seed_phrase = start_token",
    "\n",
    "    if len(seed_phrase) > MAX_LEN:",
    "\n",
    "        seed_phrase = seed_phrase[-MAX_LEN:]",
    "\n",
    "    assert type(seed_phrase) is str",
    "\n",
    "\n",
    "    snippets = []",
    "\n",
    "    for _ in range(n_snippets):",
    "\n",
    "        sample_ix = []",
    "\n",
    "        x = list(map(lambda c: token_to_id.get(c, 0), seed_phrase))",
    "\n",
    "        x = np.array([x])",
    "\n",
    "\n",
    "        for i in range(N):",
    "\n",
    "            # Pick the character that got assigned the highest probability",
    "\n",
    "            p = probs(x).ravel()",
    "\n",
    "            p = p**t / np.sum(p**t)",
    "\n",
    "            ix = np.random.choice(np.arange(len(tokens)), p=p)",
    "\n",
    "            sample_ix.append(ix)",
    "\n",
    "\n",
    "            x = np.hstack((x[-MAX_LEN+1:], [[ix]]))",
    "\n",
    "\n",
    "        random_snippet = seed_phrase + \\",
    "\n",
    "            ''.join(id_to_token[ix] for ix in sample_ix)",
    "\n",
    "        snippets.append(random_snippet)",
    "\n",
    "\n",
    "    print(\"----\\n %s \\n----\" % '; '.join(snippets))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Model training\n",
    "\n",
    "Here you can tweak parameters or insert your generation function\n",
    "\n",
    "\n",
    "__Once something word-like starts generating, try increasing seq_length__\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def sample_batch(data, batch_size):",
    "\n",
    "\n",
    "    rows = data[np.random.randint(0, len(data), size=batch_size)]",
    "\n",
    "\n",
    "    return rows[:, :-1], rows[:, 1:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "print(\"Training ...\")",
    "\n",
    "\n",
    "\n",
    "# total N iterations",
    "\n",
    "n_epochs = 100",
    "\n",
    "\n",
    "# how many minibatches are there in the epoch",
    "\n",
    "batches_per_epoch = 500",
    "\n",
    "\n",
    "# how many training sequences are processed in a single function call",
    "\n",
    "batch_size = 10",
    "\n",
    "\n",
    "\n",
    "for epoch in range(n_epochs):",
    "\n",
    "\n",
    "    print(\"Generated names\")",
    "\n",
    "    generate_sample(n_snippets=10)",
    "\n",
    "\n",
    "    avg_cost = 0",
    "\n",
    "\n",
    "    for _ in range(batches_per_epoch):",
    "\n",
    "\n",
    "        x, y = sample_batch(names_ix, batch_size)",
    "\n",
    "        avg_cost += train(x, y)",
    "\n",
    "\n",
    "    print(\"Epoch {} average loss = {}\".format(",
    "\n",
    "        epoch, avg_cost / batches_per_epoch))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "generate_sample(n_snippets=100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "generate_sample(seed_phrase=\" A\", n_snippets=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "generate_sample(seed_phrase= < whatever you please > , n_snippets=10, t=1.0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Bonus: try it out!\n",
    "You've just implemented a recurrent language model that can be tasked with generating any kind of sequence, so there's plenty of data you can try it on:\n",
    "\n",
    "* Novels/poems/songs of your favorite author\n",
    "* News titles/clickbait titles\n",
    "* Source code of Linux or Tensorflow\n",
    "* Molecules in [smiles](https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system) format\n",
    "* Melody in notes/chords format\n",
    "* Ikea catalog titles\n",
    "* Pokemon names\n",
    "* Cards from Magic, the Gathering / Hearthstone\n",
    "\n",
    "If you're willing to give it a try, here's what you wanna look at:\n",
    "* Current data format is a sequence of lines, so a novel can be formatted as a list of sentences. Alternatively, you can change data preprocessing altogether.\n",
    "* While some datasets are readily available, others can only be scraped from the web. Try `Selenium` or `Scrapy` for that.\n",
    "* Make sure MAX_LENGTH is adjusted for longer datasets. There's also a bonus section about dynamic RNNs at the bottom.\n",
    "* More complex tasks require larger RNN architecture, try more neurons or several layers. It would also require more training iterations.\n",
    "* Long-term dependencies in music, novels or molecules are better handled with LSTM or GRU\n",
    "\n",
    "__Good hunting!__"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
